Sains Malaysiana 54(8)(2025): 2087-2097

http://doi.org/10.17576/jsm-2025-5408-17

 

Improved Robust Principal Component Analysis based on Minimum Regularized Covariance Determinant for the Detection of High Leverage Points in High Dimensional Data

(Penambahbaikan Analisis Komponen Utama berdasarkan Penentu Kovarian Teratur Minimum bagi Mengecam Titik Tuasan Tinggi untuk Data Dimensi Tinggi)

 

HABSHAH MIDI1,2,*, JAAZ SUHAIZA1,3, MOHD ASLAM1,2, HANI SYAHIDA2 & EMI AMIELDA3

 

1Institute for Mathematical Research, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia

2Department of Mathematics & Statistics, Universiti Putra Malaysia, 43400 UPM Serdang,

Selangor, Malaysia

3Faculty of Computing & Multimedia, Universiti Poly-Tech Malaysia, 56100 Cheras, Kuala Lumpur, Malaysia

 

Diserahkan: 22 April 2024/Diterima: 13 Mac 2025

 

Abstract

This paper presents an extension work of robust principal component analysis (ROBPCA) denoted as IRPCA, to improve the accuracy of the detection of high leverage points (HLPs) in high dimensional data (HDD). The IRPCA employs the Principal Component Analysis (PCA) to reduce the dimension of the data set and subsequently a robust location and scatter estimates of the PC scores are obtained based on the Minimum Regularized Covariance Determinant (MRCD). Instead of using robust score distance to detect HLPs as in ROBPCA; in the proposed IRPCA, we have considered using Robust Mahalanobis distance (RMD).  The performance of the IRPCA is compared to the ROBPCA and the Minimum Regularized Covariance Determinant and PCA-based method (MRCD-PCA) for the identification of HLPs in HDD. The results signify that all the three methods are very successful in the detection of HLPs with no masking effect. Nonetheless, the ROBPCA suffers from serious swamping problems for less than 30% of HLPs. The proposed IRPCA and the MRCD-PCA have similar performance, having very small swamping effect. However, the MRCD-PCA algorithm is quite cumbersome and required longer computational running time. The attractive feature of the IRPCA is that it provides a simpler algorithm and it is very fast.

Keywords: High Leverage Point; minimum regularized covariance determinant; principal component analysis; robust mahalanobis distance

 

Abstrak

Kertas ini membentangkan kerja lanjutan bagi Analisis Komponen Utama Teguh (ROBPCA) ditandakan dengan IRPCA, untuk meningkatkan ketepatan pengecaman titik tuasan tinggi (HLPs) dalam data dimensi tinggi (HDD). IRPCA menggunakan Analisis Komponen Utama (PCA) bagi menurunkan dimensi set data dan seterusnya penganggar lokasi dan skala skor PC dikira berdasarkan Penentu Kovarian Teratur Minimum (MRCD).  Dengan tidak menggunakan jarak skor teguh untuk pengecaman HLPs seperti ROBPCA; dalam kaedah IRPCA yang dicadangkan, kami telah mempertimbangkan penggunaan Jarak Mahalanobis Teguh (RMD). Prestasi IRPCA yang dicadang dibandingkan dengan kaedah ROBPCA dan kaedah Penentu Kovarian Teratur Minimum dan PCA (MRCD-PCA) bagi mengecam HLPs dalam HDD. Keputusan menunjukkan ketiga-tiga kaedah sangat berjaya dalam pengesanan HLPs tanpa kesan penyorokan. Walau bagaimanapun, ROBPCA mengalami masalah kesan limpahan yang serius apabila terdapat HLPs kurang daripada 30%. Prestasi IRPCA yang dicadangkan dan ROBPCA ada lah sama; mempunyai kesan limpahan yang sangat kecil. Namun begitu, algoritma MRCD-PCA agak rumit dan memerlukan masa yang panjang. Sifat menarik bagi IRPCA ialah ia memberi algoritma yang mudah dan masa pengiraan yang singkat.

Kata kunci: Analisis komponen utama; jarak Mahalanobis teguh; penentu kovarian teratur minimum; titik tuasan baik

 

RUJUKAN

Agostinelli, C., Leung, A., Yohai, V.J. & Zamar, R.H. 2015. Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. Test 24(3): 441-461. https://doi.org/10.1007/s11749-015-0450-6

Boudt, K., Rousseeuw, P.J., Vanduffel, S. & Verdonck, T. 2018. The minimum regularized covariance determinant estimator. Statistics and Computing 30: 113-128. https://doi.org/10.1007/s11222-019-09869-x

Boulesteix, A.L. & Strimmer, K. 2007. Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 8(1): 32-44. https://doi.org/10.1093/bib/bbl016

Cao, L. 2006. Singular Value Decomposition Applied to Digital Image Processing. Division of Computing Studies, Arizona State University. pp. 1-15. http://www.lokminglui.com/CaoSVDintro.pdf

Chiang, J-T. 2016. The masking and swamping effects using the planted mean-shift outliers models. International Journal of Contemporary Mathematical Sciences 2(7): 297-307. https://doi.org/10.12988/ijcms.2007.07024

Dhhan, W., Rana, S. & Midi, H. 2015. Non-sparse ɛ-insensitive support vector regression for outlier detection. J. Appl. Stat. 42: 1723-1739.

Esbensen, K.H., Sch¨onkopf, S., Midtgaard, T. & Guyof, D. 1994. Multivariate Analysis in Practice. Camo, Trondheim.

Habshah, M., Norazan, M.R. & Imon, A.H.M.R. 2009. The performance of diagnostic-robust generalized potentials for the identification of multiple high leverage points in linear regression. Journal of Applied Statistics 36(5): 507-520. https://doi.org/10.1080/02664760802553463

Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24(6): 417-441. https://doi.org/10.1037/h0071325

Huber, P.J. 1973. Robust regression: Asymptotics, conjectures and Monte Carlo. The Annals of Statistics 1(5): 799-821.

Hubert, M., Rousseeuw, P.J. & Verdonck, T. 2012. A deterministic algorithm for robust location and scatter. Journal of Computational and Graphical Statistics 21(3): 618-637. https://doi.org/10.1080/10618600.2012.672100

Hubert, M., Rousseeuw, P.J. & Vanden Branden, K. 2005. ROBPCA: A new approach to robust principal component analysis. Technometrics 47(1): 64-79. https://doi.org/10.1198/004017004000000563

Hubert, M., Reynkens, T., Schmitt, E. & Verdonck, T. 2015. Sparse PCA for high-dimensional data with outliers. Technometrics 58(4): 424-434. https://doi.org/10.1080/00401706.2015.1093962

Jolliffe, I.T. 1986. Principal Component Analysis. Springer Series in Statistics. Berlin: Springer.

Killeen, D.P., Card, A., Gordon, K.C. & Perry, N.B. 2019. First use of handheld Raman spectroscopy to analyze omega-3 fatty acids in intact fish oil capsules. Applied Spectroscopy 74(3): 365-371.

Lemberge, P., De Raedt, I., Janssens, K.H., Wei, F. & Van Espen, P.J. 2000. Quantitative analysis of 16-17th century archaeological glass vessels using PLS regression of EPXMA and μ-XRF data. Journal of Chemometrics 14(5-6): 751-763. https://doi.org/10.1002/1099-128X(200009/12)14:5/6<751

Lim, H.A. & Midi, H. 2016. Diagnostic robust generalized potential based on Index Set Equality (DRGP (ISE)) for the identification of high leverage points in linear model. Computational Statistics 31: 859-877.

Midi, H., Hendi, T.H., Uraibi, H., Arasan, J. & Ismaeel, S.S. 2023. An efficient method of identification of influential observations in multiple linear regression and its application to real data. Sains Malaysiana 52(12): 3879-3892.

Midi, H., Ismaeel, S.S., Arasan, J.  & Mohammad, A.M. 2021. Simple and fast generalized-M (GM) estimator and its application to real data. Sains Malaysiana 50(3): 859-867.

Midi, M., Talib, H., Jayanthi, A. & Uraibi, H.S. 2020. Fast and robust diagnostic technique for the detection of high leverage points. Journal of Science and Technology 28(4): 1203-1220.

Mahalanobis, P.C. 1936. On the generalized distance in statistics. Proceedings of the National Institute of Sciences of India 2(1): 49-55.

Maronna, R.A. & Zamar, R.H. 2002. Robust estimates of location and dispersion for high-dimensional datasets. Technometrics 44(4): 307-317. https://doi.org/10.1198/004017002188618509

Rana, M.S., Midi, H. & Imon, A.H.M.R. 2009. A robust rescaled moment test for normality in regression.  Journal of Mathematics and Statistics 5(1): 54-62.

Rashid, A.M., Midi, H., Dhnn, W. & Arasan, J. 2021. An efficient estimation and classification methods for high dimensional data using robust iteratively reweighted SIMPLS algorithm based on Nu-support vector regression. IEEE Access 9: 45955-45967.

Rashid, A.M., Midi, H., Dhnn, W. & Arasan, J. 2022. Detection of outliers in high-dimensional data using Nu-support vector regression. Journal of Applied Statistics 49(10): 2550-2569.

Rousseeuw, P.J. 1985. Multivariate estimation with high breakdown point. Mathematical Statistics and Applications 8: 37.

Rousseeuw, P. & Driessen, K. 1999. A fast algorithm for the minimum covariance. Technometrics 41(3): 212-223.

Rousseeuw, P.J. & Van Zomeren, B.C. 1990. Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association 85: 633-651.

Siti Zahariah & Habshah Midi. 2023. Minimum regularized covariance determinant and principal component analysis - based method for the identification of high leverage points in high dimensional sparse data. Journal of Applied Statistics 50(13): 2817-2835.

Siti Zahariah, Habshah Midi & Mohd Shafie Mustafa. 2022. An improvised SIMPLS estimator based on MRCD-PCA weighting function and its application to real data. Symmetry 13(11): 2211.

Varmuza, K. & Filzmoser, P. 2009. Introduction to Multivariate Statistical Analysis in Chemometrics. Boca Raton: CRC Press. doi:10.1201/9781420059496

 

*Pengarang untuk surat-menyurat; email: habshah@upm.edu.my

 

 

 

 

 

 

 

           

sebelumnya